1,130 research outputs found

    Three-Dimensional Phylogeny Explorer: Distinguishing paralogs, lateral transfer, and violation of "molecular clock" assumption with 3D visualization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Construction and interpretation of phylogenetic trees has been a major research topic for understanding the evolution of genes. Increases in sequence data and complexity are creating a need for more powerful and insightful tree visualization tools.</p> <p>Results</p> <p>We have developed 3D Phylogeny Explorer (3DPE), a novel phylogeny tree viewer that maps trees onto three spatial axes (species on the X-axis; paralogs on Z; evolutionary distance on Y), enabling one to distinguish at a glance evolutionary features such as speciation; gene duplication and paralog evolution; lateral gene transfer; and violation of the "molecular clock" assumption. Users can input any tree on the online 3DPE, then rotate, scroll, rescale, and explore it interactively as "live" 3D views. All objects in 3DPE are clickable to display subtrees, connectivity path highlighting, sequence alignments, and gene summary views, and etc. To illustrate the value of this visualization approach for microbial genomes, we also generated 3D phylogeny analyses for all clusters from the public COG database. We constructed tree views using well-established methods and graph algorithms. We used Scientific Python to generate VRML2 3D views viewable in any web browser.</p> <p>Conclusion</p> <p>3DPE provides a novel phylogenetic tree projection method into 3D space and its web-based implementation with live 3D features for reconstruction of phylogenetic trees of COG database.</p

    Reconciling taxonomy and phylogenetic inference: formalism and algorithms for describing discord and inferring taxonomic roots

    Get PDF
    Although taxonomy is often used informally to evaluate the results of phylogenetic inference and find the root of phylogenetic trees, algorithmic methods to do so are lacking. In this paper we formalize these procedures and develop algorithms to solve the relevant problems. In particular, we introduce a new algorithm that solves a "subcoloring" problem for expressing the difference between the taxonomy and phylogeny at a given rank. This algorithm improves upon the current best algorithm in terms of asymptotic complexity for the parameter regime of interest; we also describe a branch-and-bound algorithm that saves orders of magnitude in computation on real data sets. We also develop a formalism and an algorithm for rooting phylogenetic trees according to a taxonomy. All of these algorithms are implemented in freely-available software.Comment: Version submitted to Algorithms for Molecular Biology. A number of fixes from previous versio

    On strongly chordal graphs that are not leaf powers

    Full text link
    A common task in phylogenetics is to find an evolutionary tree representing proximity relationships between species. This motivates the notion of leaf powers: a graph G = (V, E) is a leaf power if there exist a tree T on leafset V and a threshold k such that uv is an edge if and only if the distance between u and v in T is at most k. Characterizing leaf powers is a challenging open problem, along with determining the complexity of their recognition. This is in part due to the fact that few graphs are known to not be leaf powers, as such graphs are difficult to construct. Recently, Nevries and Rosenke asked if leaf powers could be characterized by strong chordality and a finite set of forbidden subgraphs. In this paper, we provide a negative answer to this question, by exhibiting an infinite family \G of (minimal) strongly chordal graphs that are not leaf powers. During the process, we establish a connection between leaf powers, alternating cycles and quartet compatibility. We also show that deciding if a chordal graph is \G-free is NP-complete, which may provide insight on the complexity of the leaf power recognition problem

    IsoBase: a database of functionally related proteins across PPI networks

    Get PDF
    We describe IsoBase, a database identifying functionally related proteins, across five major eukaryotic model organisms: Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Mus musculus and Homo Sapiens. Nearly all existing algorithms for orthology detection are based on sequence comparison. Although these have been successful in orthology prediction to some extent, we seek to go beyond these methods by the integration of sequence data and protein–protein interaction (PPI) networks to help in identifying true functionally related proteins. With that motivation, we introduce IsoBase, the first publicly available ortholog database that focuses on functionally related proteins. The groupings were computed using the IsoRankN algorithm that uses spectral methods to combine sequence and PPI data and produce clusters of functionally related proteins. These clusters compare favorably with those from existing approaches: proteins within an IsoBase cluster are more likely to share similar Gene Ontology (GO) annotation. A total of 48 120 proteins were clustered into 12 693 functionally related groups. The IsoBase database may be browsed for functionally related proteins across two or more species and may also be queried by accession numbers, species-specific identifiers, gene name or keyword. The database is freely available for download at http://isobase.csail.mit.edu/.National Institute of General Medical Sciences (U.S.) (Grant Number 1R01GM081871)Fannie and John Hertz FoundationNational Science Foundation (U.S.) (NSF MSPRF)National Science Council of Taiwan (NSC99-2218-E-007-010)National Institutes of Health (U.S.) (1R01GM081871

    Haemophilus Influenzae Microarrays: Virulence and Vaccines

    Get PDF
    In 1995 the genome sequence of the Haemophilus influenzae KW20 (Rd) strain was published, the first available for a free-living organism. The genome has been invaluable in global strategies to identify certain virulence-related genes, e.g. those involved in LPS synthesis, and also essential genes, but there is a paucity of wholegenome transcriptome studies. We have now constructed a whole-genome array consisting of genes from Rd, additional genes identified in other strains of H. influenzae and controls (from eukaryotic sources and other bacteria). We intend to use this array in studies aimed at understanding the bacterium’s basic metabolism and its response to changing environments; deciphering global regulatory networks (by comparison of wild-type and mutant strains); and identifying genes expressed in vivo. The use of H. influenzae DNA arrays combined with proteomic approaches will enhance our understanding of the metabolism and virulence of the organism. Additionally, the genome sequence of a non-typable H. influenzae strain is in progress. The sequence from this isolate will be invaluable not only in identifying potential novel antibiotic targets and putative vaccine candidates but also in the design of a microarray for genome-typing purposes

    Statistically validated networks in bipartite complex systems

    Get PDF
    Many complex systems present an intrinsic bipartite nature and are often described and modeled in terms of networks [1-5]. Examples include movies and actors [1, 2, 4], authors and scientific papers [6-9], email accounts and emails [10], plants and animals that pollinate them [11, 12]. Bipartite networks are often very heterogeneous in the number of relationships that the elements of one set establish with the elements of the other set. When one constructs a projected network with nodes from only one set, the system heterogeneity makes it very difficult to identify preferential links between the elements. Here we introduce an unsupervised method to statistically validate each link of the projected network against a null hypothesis taking into account the heterogeneity of the system. We apply our method to three different systems, namely the set of clusters of orthologous genes (COG) in completely sequenced genomes [13, 14], a set of daily returns of 500 US financial stocks, and the set of world movies of the IMDb database [15]. In all these systems, both different in size and level of heterogeneity, we find that our method is able to detect network structures which are informative about the system and are not simply expression of its heterogeneity. Specifically, our method (i) identifies the preferential relationships between the elements, (ii) naturally highlights the clustered structure of investigated systems, and (iii) allows to classify links according to the type of statistically validated relationships between the connected nodes.Comment: Main text: 13 pages, 3 figures, and 1 Table. Supplementary information: 15 pages, 3 figures, and 2 Table

    ProOpDB: Prokaryotic Operon DataBase

    Get PDF
    The Prokaryotic Operon DataBase (ProOpDB, http://operons.ibt.unam.mx/OperonPredictor) constitutes one of the most precise and complete repositories of operon predictions now available. Using our novel and highly accurate operon identification algorithm, we have predicted the operon structures of more than 1200 prokaryotic genomes. ProOpDB offers diverse alternatives by which a set of operon predictions can be retrieved including: (i) organism name, (ii) metabolic pathways, as defined by the KEGG database, (iii) gene orthology, as defined by the COG database, (iv) conserved protein domains, as defined by the Pfam database, (v) reference gene and (vi) reference operon, among others. In order to limit the operon output to non-redundant organisms, ProOpDB offers an efficient method to select the most representative organisms based on a precompiled phylogenetic distances matrix. In addition, the ProOpDB operon predictions are used directly as the input data of our Gene Context Tool to visualize their genomic context and retrieve the sequence of their corresponding 5′ regulatory regions, as well as the nucleotide or amino acid sequences of their genes

    Pseudomonas Genome Database: facilitating user-friendly, comprehensive comparisons of microbial genomes

    Get PDF
    Pseudomonas aeruginosa is a well-studied opportunistic pathogen that is particularly known for its intrinsic antimicrobial resistance, diverse metabolic capacity, and its ability to cause life threatening infections in cystic fibrosis patients. The Pseudomonas Genome Database (http://www.pseudomonas.com) was originally developed as a resource for peer-reviewed, continually updated annotation for the Pseudomonas aeruginosa PAO1 reference strain genome. In order to facilitate cross-strain and cross-species genome comparisons with other Pseudomonas species of importance, we have now expanded the database capabilities to include all Pseudomonas species, and have developed or incorporated methods to facilitate high quality comparative genomics. The database contains robust assessment of orthologs, a novel ortholog clustering method, and incorporates five views of the data at the sequence and annotation levels (Gbrowse, Mauve and custom views) to facilitate genome comparisons. A choice of simple and more flexible user-friendly Boolean search features allows researchers to search and compare annotations or sequences within or between genomes. Other features include more accurate protein subcellular localization predictions and a user-friendly, Boolean searchable log file of updates for the reference strain PAO1. This database aims to continue to provide a high quality, annotated genome resource for the research community and is available under an open source license

    Partial Homology Relations - Satisfiability in terms of Di-Cographs

    Full text link
    Directed cographs (di-cographs) play a crucial role in the reconstruction of evolutionary histories of genes based on homology relations which are binary relations between genes. A variety of methods based on pairwise sequence comparisons can be used to infer such homology relations (e.g.\ orthology, paralogy, xenology). They are \emph{satisfiable} if the relations can be explained by an event-labeled gene tree, i.e., they can simultaneously co-exist in an evolutionary history of the underlying genes. Every gene tree is equivalently interpreted as a so-called cotree that entirely encodes the structure of a di-cograph. Thus, satisfiable homology relations must necessarily form a di-cograph. The inferred homology relations might not cover each pair of genes and thus, provide only partial knowledge on the full set of homology relations. Moreover, for particular pairs of genes, it might be known with a high degree of certainty that they are not orthologs (resp.\ paralogs, xenologs) which yields forbidden pairs of genes. Motivated by this observation, we characterize (partial) satisfiable homology relations with or without forbidden gene pairs, provide a quadratic-time algorithm for their recognition and for the computation of a cotree that explains the given relations
    corecore